{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# RAG for Table Comparisons with LlamaParse + LlamaIndex\n",
    "\n",
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/demo_table_comparisons.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\n",
    "This notebook shows you how to do comparisons across both tabular and text data across multiple PDF documents.\n",
    "\n",
    "We load in multiple PDFs with embedded tables (2021 and 2020 10K filings for Apple) using LlamaParse, parse each into a hierarchy of tables/text objects, define a recursive retriever over each, and then compose both with a SubQuestionQueryEngine."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Install core packages, download files, parse documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index\n",
    "!pip install llama-index-core\n",
    "!pip install llama-index-embeddings-openai\n",
    "!pip install llama-index-question-gen-openai\n",
    "!pip install llama-index-postprocessor-flag-embedding-reranker\n",
    "!pip install git+https://github.com/FlagOpen/FlagEmbedding.git\n",
    "!pip install llama-parse"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget \"https://s2.q4cdn.com/470004039/files/doc_financials/2020/ar/_10-K-2020-(As-Filed).pdf\" -O apple_2020_10k.pdf\n",
    "!wget \"https://s2.q4cdn.com/470004039/files/doc_financials/2021/q4/_10-K-2021-(As-Filed).pdf\" -O apple_2021_10k.pdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some OpenAI and LlamaParse details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio\n",
    "import nest_asyncio\n",
    "nest_asyncio.apply()\n",
    "\n",
    "import os\n",
    "# API access to llama-cloud\n",
    "os.environ[\"LLAMA_CLOUD_API_KEY\"] = \"llx-\"\n",
    "\n",
    "# Using OpenAI API for embeddings/llms\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.llms.openai import OpenAI\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "from llama_index.core import VectorStoreIndex\n",
    "from llama_index.core import Settings\n",
    "\n",
    "embed_model=OpenAIEmbedding(model=\"text-embedding-3-small\")\n",
    "llm = OpenAI(model=\"gpt-3.5-turbo-0125\")\n",
    "\n",
    "Settings.llm = llm\n",
    "Settings.embed_model = embed_model\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using brand new `LlamaParse` PDF reader for PDF Parsing\n",
    "\n",
    "we also compare two different retrieval/query engine strategies:\n",
    "1. Using raw Markdown text as nodes for building index and apply simple query engine for generating the results;\n",
    "2. Using `MarkdownElementNodeParser` for parsing the `LlamaParse` output Markdown results and building recursive retriever query engine for generation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_parse import LlamaParse\n",
    "\n",
    "docs_2021 = LlamaParse(result_type=\"markdown\").load_data('./apple_2021_10k.pdf')\n",
    "docs_2020 = LlamaParse(result_type=\"markdown\").load_data('./apple_2020_10k.pdf')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Recursive Retriever over each Document\n",
    "\n",
    "We define a function to get a recursive retriever from each document. The steps are the following:\n",
    "- Hierarchically parse the document using our `MarkdownElementNodeParser`, which will embed/summarize embedded tables.\n",
    "- Load into a vector store. Under the hood we will automatically store links between nodes (e.g. table summary to table text).\n",
    "- Get a query engine over the vector store, which performs retrieval/synthesis. Under the hood we will automatically perform recursive retrieval if there are links."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.node_parser import MarkdownElementNodeParser\n",
    "\n",
    "node_parser = MarkdownElementNodeParser(llm=OpenAI(model=\"gpt-3.5-turbo-0125\"), num_workers=8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import pickle\n",
    "from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker\n",
    "\n",
    "reranker = FlagEmbeddingReranker(\n",
    "    top_n=5,\n",
    "    model=\"BAAI/bge-reranker-large\",\n",
    ")\n",
    "\n",
    "def create_query_engine_over_doc(docs, nodes_save_path=None):\n",
    "    \"\"\"Big function to go from document path -> recursive retriever.\"\"\"\n",
    "    if nodes_save_path is not None and os.path.exists(nodes_save_path):\n",
    "        raw_nodes = pickle.load(open(nodes_save_path, \"rb\"))\n",
    "    else:\n",
    "        raw_nodes = node_parser.get_nodes_from_documents(docs)\n",
    "        if nodes_save_path is not None:\n",
    "            pickle.dump(raw_nodes, open(nodes_save_path, \"wb\"))\n",
    "\n",
    "    base_nodes, objects = node_parser.get_nodes_and_objects(\n",
    "        raw_nodes\n",
    "    )\n",
    "\n",
    "    ### Construct Retrievers\n",
    "    # construct top-level vector index + query engine\n",
    "    vector_index = VectorStoreIndex(nodes=base_nodes+objects)\n",
    "    query_engine = vector_index.as_query_engine(\n",
    "        similarity_top_k=15,\n",
    "        node_postprocessors=[reranker]\n",
    "    )\n",
    "    return query_engine, base_nodes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "query_engine_2021, nodes_2021 = create_query_engine_over_doc(\n",
    "    docs_2021, nodes_save_path=\"2021_nodes.pkl\"\n",
    ")\n",
    "query_engine_2020, nodes_2020 = create_query_engine_over_doc(\n",
    "    docs_2020, nodes_save_path=\"2020_nodes.pkl\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from llama_index.core.tools import QueryEngineTool, ToolMetadata\n",
    "from llama_index.core.query_engine import SubQuestionQueryEngine\n",
    "\n",
    "\n",
    "# setup base query engine as tool\n",
    "query_engine_tools = [\n",
    "    QueryEngineTool(\n",
    "        query_engine=query_engine_2021,\n",
    "        metadata=ToolMetadata(\n",
    "            name=\"apple_2021_10k\",\n",
    "            description=(\n",
    "                \"Provides information about Apple financials for year 2021\"\n",
    "            ),\n",
    "        ),\n",
    "    ),\n",
    "    QueryEngineTool(\n",
    "        query_engine=query_engine_2020,\n",
    "        metadata=ToolMetadata(\n",
    "            name=\"apple_2020_10k\",\n",
    "            description=(\n",
    "                \"Provides information about Apple financials for year 2020\"\n",
    "            ),\n",
    "        ),\n",
    "    ),\n",
    "]\n",
    "\n",
    "sub_query_engine = SubQuestionQueryEngine.from_defaults(\n",
    "    query_engine_tools=query_engine_tools,\n",
    "    llm=llm,\n",
    "    use_async=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Try out Some Comparisons"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generated 4 sub questions.\n",
      "\u001b[1;3;38;2;237;90;200m[apple_2021_10k] Q: What are the deferred assets in 2021?\n",
      "\u001b[0m\u001b[1;3;38;2;90;149;237m[apple_2021_10k] Q: What are the deferred liabilities in 2021?\n",
      "\u001b[0m\u001b[1;3;38;2;11;159;203m[apple_2020_10k] Q: What are the deferred assets in 2020?\n",
      "\u001b[0m\u001b[1;3;38;2;155;135;227m[apple_2020_10k] Q: What are the deferred liabilities in 2020?\n",
      "\u001b[0m\u001b[1;3;38;2;90;149;237m[apple_2021_10k] A: $7,200\n",
      "\u001b[0m\u001b[1;3;38;2;155;135;227m[apple_2020_10k] A: $10,138\n",
      "\u001b[0m\u001b[1;3;38;2;237;90;200m[apple_2021_10k] A: $25,176 million\n",
      "\u001b[0m\u001b[1;3;38;2;11;159;203m[apple_2020_10k] A: $19,336\n",
      "\u001b[0m"
     ]
    }
   ],
   "source": [
    "response = sub_query_engine.query(\n",
    "    \"Can you compare and contrast the deferred assets and liabilities in 2021 with 2020?\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In 2021, the deferred assets increased by $5,840 million compared to 2020, while the deferred liabilities decreased by $2,938 million in the same period.\n"
     ]
    }
   ],
   "source": [
    "print(str(response))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generated 2 sub questions.\n",
      "\u001b[1;3;38;2;237;90;200m[apple_2021_10k] Q: What is the total number of RSUs in Apple's 2021 financials?\n",
      "\u001b[0m\u001b[1;3;38;2;90;149;237m[apple_2020_10k] Q: What is the total number of RSUs in Apple's 2020 financials?\n",
      "\u001b[0m\u001b[1;3;38;2;237;90;200m[apple_2021_10k] A: The total number of RSUs in Apple's 2021 financials is 240,427.\n",
      "\u001b[0m\u001b[1;3;38;2;90;149;237m[apple_2020_10k] A: The total number of RSUs in Apple's 2020 financials is 310,778.\n",
      "\u001b[0m"
     ]
    }
   ],
   "source": [
    "response = sub_query_engine.query(\n",
    "    \"Can you compare and contrast the total number of RSUs in 2021 and 2020?\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generated 2 sub questions.\n",
      "\u001b[1;3;38;2;237;90;200m[apple_2021_10k] Q: What are the risk factors mentioned in the 2021 financial report of Apple?\n",
      "\u001b[0m\u001b[1;3;38;2;90;149;237m[apple_2020_10k] Q: What are the risk factors mentioned in the 2020 financial report of Apple?\n",
      "\u001b[0m\u001b[1;3;38;2;237;90;200m[apple_2021_10k] A: The risk factors mentioned in the 2021 financial report of Apple include risks related to COVID-19, macroeconomic and industry risks, political events, trade and international disputes, natural disasters, public health issues, industrial accidents, credit risk, fluctuations in foreign currency exchange rates, changes in tax rates and legislation, volatility in the price of the company's stock, and exposure to legal proceedings and claims.\n",
      "\u001b[0m\u001b[1;3;38;2;90;149;237m[apple_2020_10k] A: The risk factors mentioned in the 2020 financial report of Apple include the impact of the COVID-19 pandemic on the company's business operations, financial condition, and stock price; global and regional economic conditions affecting demand for products and services; competition in global markets with rapid technological changes; potential disruptions in the supply chain due to industrial accidents or public health issues; information technology system failures or network disruptions affecting business operations; risks associated with confidential information security and potential unauthorized access; fluctuations in quarterly net sales and operating results due to various factors; stock price volatility impacting investor confidence and employee retention; financial performance risks related to changes in foreign currency exchange rates affecting sales and earnings.\n",
      "\u001b[0m"
     ]
    }
   ],
   "source": [
    "response = sub_query_engine.query(\n",
    "    \"Can you compare and contrast the risk factors in 2021 vs. 2020?\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The risk factors mentioned in the 2021 financial report of Apple include risks related to COVID-19, macroeconomic and industry risks, political events, trade and international disputes, natural disasters, public health issues, industrial accidents, credit risk, fluctuations in foreign currency exchange rates, changes in tax rates and legislation, volatility in the price of the company's stock, and exposure to legal proceedings and claims. In contrast, the risk factors mentioned in the 2020 financial report of Apple focused more on the impact of the COVID-19 pandemic on the company's business operations, financial condition, and stock price; global and regional economic conditions affecting demand for products and services; competition in global markets with rapid technological changes; potential disruptions in the supply chain due to industrial accidents or public health issues; information technology system failures or network disruptions affecting business operations; risks associated with confidential information security and potential unauthorized access; fluctuations in quarterly net sales and operating results due to various factors; stock price volatility impacting investor confidence and employee retention; financial performance risks related to changes in foreign currency exchange rates affecting sales and earnings.\n"
     ]
    }
   ],
   "source": [
    "print(str(response))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_parse",
   "language": "python",
   "name": "llama_parse"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
